Search for: All records

Creators/Authors contains: "Parhi, Keshab K"

« Prev Next »

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

RECAPHE: REconfigurable Polynomial Modular Computation Architectures for Unified PQC and HE Schemes

Wang, Antian; Zhang, Kaiyuan; Parhi, Keshab K; Lao, Yingjie (December 2025, Asilomar)

Full Text Available
Encoder Circuit Optimization for Nonbinary Quantum Error Correction Codes in Prime Dimensions: An Algorithmic Framework

https://doi.org/10.1109/TQE.2026.3669054

Sodhani, Aditya; Parhi, Keshab K (March 2026, IEEE Transactions on Quantum Engineering)

Quantum computers are a revolutionary class of computational platforms with applications in combinatorial and global optimization, machine learning, and other domains involving computationally hard problems. While these machines typically operate on qubits—quantum information elements that can occupy superpositions of the basis |0⟩ and |1⟩ states—recent advances have demonstrated the practical implementation of higher dimensional quantum systems (qudits) across various hardware platforms. In these hardware realizations, the higher order states are less stable, and thus remain coherent for a shorter duration than the basis |0⟩ and |1⟩ states. Moreover, formal methods for designing efficient encoder circuits for these systems remain underexplored. This limitation motivates the development of efficient circuit techniques for qudit systems (d-level quantum systems). Previous works have typically established generating gate sets for higher dimensional codes by generalizing the methods used for qubits. In this work, we introduce a systematic framework for optimizing encoder circuits for prime-dimension stabilizer codes. This framework is based on novel generating gate sets whose elements map directly to efficient Clifford gate sequences. We demonstrate the effectiveness of this method on key codes, achieving a 13% –44% reduction in encoder circuit gate count for the qutrit (d = 3) [[9,5,3]]3, [[5,1,3]]3, and [[7,1,3]]3 codes, and a 9% –21% reduction for the ququint (d = 5) [[10,6,3]]5 code when compared to prior work. We also achieved circuit depth reductions up to 42%.
more » « less
Full Text Available
The Equivalence of Fast Algorithms for Convolution, Parallel FIR Filters, Polynomial Modular Multiplication, and Pointwise Multiplication in DFT/NTT Domain

https://doi.org/10.1109/IEEECONF67917.2025.11443433

Parhi, Keshab K (October 2025, IEEE)

Fast time-domain algorithms have been developed in signal processing applications to reduce the multiplication complexity. For example, fast convolution structures using Cook-Toom and Winograd algorithms are well understood. Short length fast convolutions can be iterated to obtain fast convolution structures for long lengths. In this paper, we show that well known fast convolution structures form the basis for design of fast algorithms in four other problem domains: fast parallel filters, fast polynomial modular multiplication, and fast pointwise multiplication in the DFT and NTT domains. Fast polynomial modular multiplication and fast pointwise multiplication problems are important for cryptosystem applications such as post-quantum cryptography and homomorphic encryption. By establishing the equivalence of these problems, we show that a fast structure from one domain can be used to design a fast structure for another domain. This understanding is important as there are many well known solutions for fast convolution that can be used in other signal processing and cryptosystem applications.
more » « less
Full Text Available
Efficient Decomposition of Multistage Composite Length FFT for Complex and Real Signals

https://doi.org/10.1109/TSP.2026.3664785

Chiu, Sin-Wei; Parhi, Keshab K (February 2026, IEEE Transactions on Signal Processing)

Fast Fourier Transform (FFT) algorithms play a fundamental role in modern digital signal processing (DSP). A composite length FFT decomposes a longer-length FFT into multiple shorter-length FFTs, enabling a bottom-up construction approach. If the input signal is complex-valued (real-valued), the DFT computation is referred to as CFFT (RFFT). If the input is real-valued, redundancies and symmetries in the RFFT can be exploited to design more efficient algorithms. In this work, we extend the concept of two-stage composite FFTs to multistage composite FFTs. In a multistage FFT, the decomposition and parenthesization of sub-blocks significantly affect the structure’s efficiency. Although the set of sub-blocks remains the same, different parenthesizations lead to varying numbers of twiddle factor multiplications in the interconnect stages. We adopt radix-2 FFTs as the fundamental building blocks and apply dynamic programming to determine the optimal decomposition and parenthesization for arbitrary power-of-two FFT lengths. Our work presents optimal strategies for both complex-valued and real-valued FFTs. Compared to the radix-2 FFT, our multistage FFT reduces the number of complex multiplications by approximately 20% for complex-valued inputs and 30% for real-valued inputs. These results demonstrate that the multistage FFT approach not only simplifies the design process but also improves computational efficiency by significantly reducing the number of complex multiplications. Furthermore, the composite multistage structure enables an efficient bottom-up design methodology, requiring only two types of sub-blocks for CFFT and four types for RFFT implementations. Designers can follow the proposed straightforward design strategies, which guarantee the construction of optimal CFFT and RFFT structures.
more » « less
Full Text Available
Architectures for Serial and Parallel Pipelined NTT-Based Polynomial Modular Multiplication

https://doi.org/10.1109/TVLSI.2025.3576782

Chiu, Sin-Wei; Parhi, Keshab K (June 2025, IEEE Transactions on Very Large Scale Integration (VLSI) Systems)

Quantum computers pose a significant threat to modern cryptographic systems by efficiently solving problems such as integer factorization through Shor’s algorithm. Homomorphic encryption (HE) schemes based on ring learning with errors (Ring-LWE) offer a quantum-resistant framework for secure computations on encrypted data. Many of these schemes rely on polynomial multiplication, which can be efficiently accelerated using the number theoretic transform (NTT) in leveled HE, ensuring practical performance for privacy-preserving applications. This article presents a novel NTT-based serial pipelined multiplier that achieves full-hardware utilization through interleaved folding, and overcomes the 50% under-utilization limitation of the conventional serial R2MDC architecture. In addition, it explores tradeoffs in pipelined parallel designs, including serial, 2-parallel, and 4-parallel architectures. Our designs leverage increased parallelism, efficient folding techniques, and optimizations for a selected constant modulus to achieve superior throughput (TP) compared with state-of-the-art implementations. While the serial fold design minimizes area consumption, the 4-parallel design maximizes TP. Experimental results on the Virtex-7 platform demonstrate that our architectures achieve at least 2.22 times higher TP/area for a polynomial length of 1024 and 1.84 times for a polynomial length of 4096 in the serial fold design, while the 4-parallel design achieves at least 2.78 times and 2.79 times, respectively. The efficiency gain is even more pronounced in TP squared over area, where the serial fold and 4-parallel designs outperform prior works by at least 4.98 times and 26.43 times for a polynomial length of 1024 and 6.7 times and 43.77 times for a polynomial length of 4096, respectively. These results highlight the effectiveness of our architectures in balancing performance, area efficiency, and flexibility, making them well-suited for high-speed cryptographic applications.
more » « less
Full Text Available
On Computing Linear, Positive-Wrapped (Circular), and Negative-Wrapped Convolutions in the Frequency Domain [Tips & Tricks]

https://doi.org/10.1109/MSP.2025.3544185

Chiu, Sin-Wei; Parhi, Keshab K (May 2025, IEEE Signal Processing Magazine)

Convolution is a fundamental operation with diverse applications in signal processing, computer vision, and machine learning. This article reviews three distinct convolutions: linear convolution (also referred to as aperiodic convolution), positive-wrapped convolution (PWC) (also known as circular convolution), and negative-wrapped convolution (NWC). Additionally, we propose an alternative approach to computing linear convolution without zero padding by leveraging the PWC and NWC. We compare two fast Fourier transform (FFT)-based methods to compute linear convolution: the traditional zero-padded PWC method and a new method based on the PWC and NWC. Through a detailed analysis of the flowgraphs (FGs), we demonstrate the equivalence of these methods while highlighting their unique characteristics. We show that computing the NWC using the weighted PWC method is equivalent to a part of the linear convolution computation with zero padding. Furthermore, it is possible to extract the PWC and NWC from structures to compute linear convolution with zero padding, where the last butterfly stage can be eliminated. This article aims to establish a clear connection among PWC, NWC, and linear convolution, illustrating new perspectives on computing different convolutions.
more » « less
Full Text Available
LayerPipe2: Multistage Pipelining and Weight Recompute via Improved Exponential Moving Average for Training Neural Networks

https://doi.org/10.1109/IEEECONF67917.2025.11443520

Unnikrishnan, Nanda K; Parhi, Keshab K (October 2025, IEEE)

In our prior work, LayerPipe, we had introduced an approach to accelerate training of convolutional, fully connected, and spiking neural networks by overlapping forward and backward computation. However, despite empirical success, a principled understanding of how much gradient delay needs to be introduced at each layer to achieve desired level of pipelining was not addressed. This paper, LayerPipe2, fills that gap by formally deriving LayerPipe using variable delayed–gradient adaptation and retiming. We identify where delays may be legally inserted and show that the required amount of delay follows directly from the network structure: inner layers require fewer delays, while outer layers require longer delays. When pipelining is applied at every layer, each delay depends only on the number of remaining downstream stages; when layers are pipelined in groups, all layers in the group share the same assignment. These insights not only explain previously observed scheduling patterns but also expose an often-overlooked challenge: pipelining implicitly requires storage of historical weights. We overcome this storage bottleneck by developing a pipeline–aware moving average that reconstructs the required past states rather than storing them explicitly. This reduces memory cost without sacrificing the accuracy guarantees that makes pipelined learning viable. The result is a principled framework that illustrates how to construct LayerPipe architectures, predicts their delay requirements, and mitigates their storage burden, thereby enabling scalable pipelined training with controlled communication–computation tradeoffs.
more » « less
Full Text Available
Graph Convolution Network Based Classification of Subjects with Prefrontal Cortex Lesion via Information-theoretic Brain Network Features

https://doi.org/10.1007/s11265-025-01944-z

Balaji, Sai Sanjay; Parhi, Keshab K (January 2025, Journal of Signal Processing Systems)

This paper investigates scalp electroencephalogram (EEG) data from 14 subjects with unilateral prefrontal cortex (pFC) lesions and 20 healthy controls during lateral visuospatial working memory (WM) tasks. The goal is to differentiate the brain networks involved in WM processing between these groups. The EEG recordings are transformed into graph signals, with proximity-weighted brain connectivity measures as edges and centrality measures as nodal features. Graph convolutional network (GCN) layers are used for feature representation, followed by a fully connected layer for classification. The GCN-based model effectively handles nine classification tasks, proving that graph-based network representation is versatile for describing brain interactions. The sparse MI-GCI-based graph model’s accuracy effectively captures the functional segregation of distinct WM tasks. The classifier using mutual information-guided Granger causality index (MI-GCI) with 20% of top edges matched prior classification performance with 67% fewer parameters and 80% less graph density, identifying the correct class of all 34 subjects in group identification using leave-one-out cross-validation and two-thirds majority voting.
more » « less
Full Text Available
Low-Complexity NTT and INTT Structures via Twiddle Shifting

https://doi.org/10.1109/MWSCAS53549.2025.11244551

Chiu, Sin-Wei; Parhi, Keshab K (August 2025, IEEE Midwest Symposium on Circuits and Systems)

Polynomial modular multiplication is an important operation used in post-quantum cryptography and homomorphic encryption, which are based on ring learning with errors (RLWE) problems. For long polynomial lengths, this operation can be efficiently computed using number theoretic transform (NTT) and inverse NTT (INTT). In particular, negative wrapped convolution (NWC) has been proposed to compute this operation where zero padding is eliminated. Low-complexity structures for NTT (LCNTT) and INTT (LC-INTT) have been derived in prior work by using a divide-and-conquer approach. This paper presents an alternate derivation of the LC-NTT and LC-INTT structures from traditional NTT and INTT structures. Specifically, we show that using twiddle factor pushing (pulling) from left to right (right to left), we can derive the prior LC-NTT (LC-INTT) structures. We present systematic algorithms for twiddle factor pushing and pulling to derive the equivalent architectures. The alternate approach may provide opportunities for optimizing hardware implementations of polynomial modular multiplication.
more » « less
Full Text Available
Architectural Tradeoffs for Long Polynomial Modular Multiplication

https://doi.org/10.1109/IEEECONF60004.2024.10942651

Chiu, Sin-Wei; Parhi, Keshab K (October 2024, IEEE)

Polynomial multiplication over the quotient ring is a critical operation in Ring Learning with Errors (Ring-LWE) based cryptosystems that are used for post-quantum cryptography and homomorphic encryption. This operation can be efficiently implemented using number-theoretic transform (NTT)-based architectures. Among these, pipelined parallel NTTbased polynomial multipliers are attractive for cloud computing as these are well suited for high throughput and low latency applications. For a given polynomial length, a pipelined parallel NTT-based multiplier can be designed with varying degrees of parallelism, resulting in different tradeoffs. Higher parallelism reduces latency but increases area and power consumption,and vice versa. In this paper, we develop a predictive model based on synthesized results for pipelined parallel NTT-based polynomial multipliers and analyze design tradeoffs in terms of area, power, energy, area-time product, and area-energy product across parallelism levels up to 128. We predict that, for very long polynomials, area and power differences between designs with varying levels of parallelism become negligible. In contrast, areatime product and energy per polynomial multiplication decrease with increased parallelism. Our findings suggest that, given area and power constraints, the highest feasible level of parallelism optimizes latency, area-time product, and energy per polynomial multiplication.
more » « less
Full Text Available

« Prev Next »